CS 229 Project: A Machine Learning Framework for Biochemical Reaction Matching

نویسندگان

  • Tomer Altman
  • Eric Burkhart
  • Irene M. Kaplow
  • Ryan Thompson
چکیده

Biochemical reaction databases capture the sum of human knowledge of biochemical reactions and chemical compounds. As the amount of data available on metabolic reactions and chemical substrates increases, the necessity of central repositories increases as well. Unfortunately, there is no established algorithm for being able to calculate the degree of similarity between reactions in different databases. A set of features have been defined and were calculated for all pairs of reactions between the Kegg and MetaCyc reaction databases. Features include reaction name match, Tanimoto coefficient of the reactions in stoichiometric vector form, enzyme identifier matching, and Enzyme Commission classification differences. Logistic regression, non-linear SVM, näıve Bayes, and decision tree learning methods were implemented, and feature selection, cross-validation, k-means clustering and other debugging methods were applied to determine how to improve the algorithms and data. We conclude that decision trees and logistic regression provide the most accurate methods, and the Tanimoto coefficient is a key feature for the performance of both learning methods.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

CS 229 = = Final Project Report SPEECH & NOISE SEPARATION

In this course project I investigated machine learning approaches on separating speech signals from background noise. Keywords—MFCC, SVM, noise separation, source separation, spectrogram

متن کامل

Spectral Learning of General Latent-Variable Probabilistic Graphical Models:A Supervised Learning Approach

In this CS 229 project, I designed, proved and tested a new spectral learning algorithm for learning probabilistic graphical models with latent variables by reducing the hard learning problem into a pipeline of supervised learning tasks. This new algorithmic framework can provide us with more learning power by giving us the freedom to plug in all different kinds of supervised learning algorithm...

متن کامل

Sean Augenstein Su Id 0533 0698 Cs 229 Project Report Learning Techniques to Aid Pose Estimation via Sift

I am working on machine learning techniques to intelligently track and match features in a sequence of visual images. Specifically, the feature I am tracking is known as the Scale Invariant Feature Transform (SIFT). My project involves using a camera to capture images of the motion of robotic objects in my lab. I used PCA to find a small subset of the SIFT vectors that best match the object bet...

متن کامل

CS 229 Project Report: San Francisco Crime Classification

Different machine learning approaches were conceptualized and implemented for predicting the probabilities of crime categories for crimes reported in San Francisco. The crimes records used in the research are downloaded from a competition on Kaggle. A Bayesian model, a mixture of Guassians model (stratified and unstratified), and logistic regression are implemented. A satisfactory result was ac...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2010